library(roahd)
Functional Data Analysis is one of the core research areas of our department, you have already seen during class that some of the state-of-the-art techniques were actually developed here. Depth-based techniques find a natural application in Functional Statistics. I assume that many of you already know what we mean by functional data (applied statistics course docet), nevertheless do not worry if you do not: I have structured this tutorial as a very basic primer to understand how to do some methodological and applied work in the framework of functional data analysis (FDA).
According to the FDA model, data can be seen as measurements of a quantity/quantities along a given, independent and continuous indexing variable (such as time or space). Observations are then treated as random functions and can be viewed as trajectories of stochastic processes defined on a given infinite dimensional functional space.
A natural way to reason (and actually comprehend why they need specific and peculiar techniques) is to think about them as relatively dense longitudinal data: unlike tabular data, functional data are not invariant to permutations of their dimensions (i.e., if I switch two columns of a standard multivariate dataset no harm is done, if I do it with a functional one, I will create a bloody mess.)
Given its ubiquitous presence in applications FDA has been a very
active research field in the past decades, with a plethora of R packages
developed for helping practitioner efficiently and effectively employ
such techniques. In what follows we will focus our attention on the
roahd package (developed by members and former members of
our Department of Mathematics): a package meant to collect and provide
methods for the analysis of univariate and multivariate functional
datasets through the use of robust methods. First off we will dedicate
some time in understanding how to represent these complex infinite
dimensional objects (our computers only deal with finite approximations)
and how to simulate functional data. We will then focus on computation
of depths and outlier detection. This lab WILL NOT provide a thorough
treatment of FDA, but we will limit to study some introductory concepts
on how to deal with univariate (and bivariate, only very briefly)
functional datasets.
The way roahd package represents functional objects is
by providing an evenly spaced grid \(I=\left[t_{0}, t_{1}, \ldots,
t_{P-1}\right]\) \(\left(t_{j}-t_{j-1}=h>0, \: \forall j=1,
\ldots, P-1\right)\) over which the functional observations \(D_{i, j}=X_{i}\left(t_{j}\right)\) \(\forall i=1, \ldots, N\) and \(\forall j=0, \ldots, P-1\) are measured.
This is very conveniently handled by the fData object
class. In particular, the following model is considered for the
generation of data: \[
X(t)=m(t)+\varepsilon(t), \text { for all } t \text { in } I
\] where \(m(t)\) is the center
and \(\varepsilon(t)\) is a centered
gaussian process with covariance function \(C(\cdot,\cdot)\). That is to say: \[
\operatorname{Cov}(\varepsilon(\mathrm{s}), \varepsilon(t))=C(s, t),
\text { with } s, t \text { in } I
\] The employed structure for \(C(s,
t)\) is the Exponential covariance function:
\[ C(s, t)=\alpha e^{-\beta|s-t|} \]
P <- 101
grid <- seq( 0, 1, length.out = P)
alpha <- 0.2
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
C_st contains a \(P \times
P\) matrix of values.
image( C_st,
main = 'Exponential covariance function',
xlab = 'grid', ylab = 'grid')
After having defined the mean function \(m(t)\)
m <- sin(pi*grid)+sin(2*pi*grid)
we are ready to generate functional data as follows
n <- 100
set.seed(26111992)
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
The output of the previous chunk is actually a \(n \times P\) matrix, where \(P\) is equal to the length of the grid. That is, instead of having a “proper” functional datum, I have its evaluation on a relatively fine grid (this is actually the best we can do). We can plot it with:
matplot(grid,t(data), type="l", col=adjustcolor(col=1,alpha.f = .4))
lines(grid,m, col="blue", lwd=5)
Or even better we exploit the potential of the roahd
package by constructing a functional object:
f_data <- fData(grid,data)
plot(f_data) # what happens if I do plot(data)?
lines(grid,m, col="black", lwd=5)
The first command defines an object of class fdata
class(f_data)
## [1] "fData"
with dedicated methods for plots, the four basic algebraic
operations, subsetting and much more. In particular, using objects of
class fdata makes the computation of depth measures and
related quantities much easier (see next section). For a thorough
account on the potential of the roahd package, check this
paper.
By changing the hyperparameters \(\alpha\) and \(\beta\) in the Exponential covariance function, we can generate functional datum with different degree of dependence and variability. In details, \(\alpha\) controls the overall level of variability in the signal, while the parameter \(\beta\) affects the autocorrelation length of the signal’s noise, with lower values of \(\beta\) leading to wider correlation lengths and vice-versa
alpha <- 1
beta <- 0.2
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="High overall level of variability")
alpha <- .1
beta <- 0.0001
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="High smothness")
alpha <- .1
beta <- 100
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
plot(f_data, main="Virtually uncorrelated signals")
Let us now consider the data you have seen in the Case Study: ECG
signals. The registered and smoothed signals are contained in the
mfD_healthy object.
data("mfD_healthy") #
univariate_fdata <- mfD_healthy$fDList[[1]] # I consider the first lead only
plot(univariate_fdata)
With roadhd it is very easy to compute Band depths and
Modified band depths for a given fdata object:
band_depth <- BD(Data = univariate_fdata)
modified_band_depth <- MBD(Data = univariate_fdata)
We can compute the median curve
median_curve <- median_fData(fData = univariate_fdata, type = "MBD") # still an fData object
Or manually by firstly computing the Modified Band Depth and then identifying the curve with max MBD
median_curve_manual <- univariate_fdata[which.max(modified_band_depth),] # still an fData object
all(median_curve_manual$values==median_curve$values)
## [1] TRUE
plot(univariate_fdata)
grid_ecg <- seq(median_curve_manual$t0,median_curve_manual$tP,by=median_curve_manual$h)
lines(grid_ecg,median_curve_manual$values)
Epigraph Index (EI) and Hypograph Index (HI) or their corresponding
Modified versions (MEI and MHI) for providing down-upward/up-downward
order of data can also be computed: see functions EI,
HI, MEI, MHI. As an example, let
us compute the Spearman’s correlation index between the first and second
lead of the ECG signals. Recall that for a bivariate functional dataset
\([\mathbf{x} \mathbf{y}]\)
\[ \hat{\rho}_{s}(\mathbf{x}, \mathbf{y})=\hat{\rho}_{p}\left(I L_{n}-\operatorname{grade}(\mathbf{x}), I L_{n}-\operatorname{grade}(\mathbf{y})\right) \] where \(\hat{\rho}_{p}\) is the sample Pearson correlation coefficient and \[ \begin{aligned} &I L_{n}-\operatorname{grade}(\mathbf{x})=\left(I L_{n}-\operatorname{grade}\left(x_{1}\right), I L_{n}-\operatorname{grade}\left(x_{2}\right), \ldots, I L_{n}-\operatorname{grade}\left(x_{n}\right)\right) \\ &I L_{n}-\operatorname{grade}(\mathbf{y})=\left(I L_{n}-\operatorname{grade}\left(y_{1}\right), I L_{n}-\operatorname{grade}\left(y_{2}\right), \ldots, I L_{n}-\operatorname{grade}\left(y_{n}\right)\right) . \end{aligned} \] where the inferior length grade \(I L_{n}-\operatorname{grade}(x_i)\) can be interpreted as the “proportion of time” that the sample \(x_1,\ldots,x_n\) is smaller than \(x_i\), that is it defines the relative position of a curve with respect to the sample.
The Spearman’s correlation index is immediately obtained via the
cor_spearman function
bivariate_data <- as.mfData(list(mfD_healthy$fDList[[1]], mfD_healthy$fDList[[2]]))
plot(bivariate_data)
cor_spearman(bivariate_data, ordering='MHI')
## [1] 0.6811366
Or actually we can manually compute it:
MHI_first_lead <- MHI(bivariate_data$fDList[[1]])
MHI_second_lead <- MHI(bivariate_data$fDList[[2]])
cor(MHI_first_lead, MHI_second_lead)
## [1] 0.6811366
Coding tip: never forget the power of functional programming!
do.call(args = lapply(1:2, function(ind)
MHI(bivariate_data$fDList[[ind]])), what = "cor")
## [1] 0.6811366
We conclude this lab by looking at some graphical tools for identyfing outliers in a functional sample. Consider the following univariate functional data
alpha <- 0.2
beta <- 0.002
C_st <- exp_cov_function( grid, alpha, beta )
data <- generate_gauss_fdata(N = n,centerline = m,Cov=C_st)
f_data <- fData(grid,data)
We consider two cases: first a dataset with some magnitude outliers, obtained inflating the last 10 curves by a number generated from a uniform distribution in \([2,3]\)
set.seed(33) # reproducibility
outlier_share <- .1
n_outliers <- n*outlier_share
out_highlighter <- rep(c(1,2),c(n-n_outliers,n_outliers))
f_data_temp <- f_data[1:(n*(1-outlier_share)),] # Coding tip: subsetting is mabe possible by the S3 class fdata
mag_temp <- f_data[(n*(1-outlier_share)+1):n,] * runif(10,2,3)
f_data_mag <- append_fData(f_data_temp,mag_temp)
plot(f_data_mag, col=out_highlighter)
And then some shape outliers by shifting the generating mechanism by
the quantity shift_q:
shift_q <- .5
mu_warp=mu=sin(pi*grid+shift_q)+sin(2*pi*grid+shift_q)
shape_temp=generate_gauss_fdata(N = n_outliers, mu_warp, Cov=C_st)
shape_temp=fData(grid,shape_temp)
f_data_shape=append_fData(f_data_temp,shape_temp)
plot(f_data_shape, col=out_highlighter)
Now let us see whether we can employ the two plotting tools we have seen in class, namely functional boxplots and outliegrams to detect magnitude and shape outliers, respectively.
invisible(fbplot(f_data_mag, main="Magnitude outliers"))
invisible(outliergram(f_data_mag))
I used invisible() in the previous chunk to avoid the
output of the call to be printed on the console (and in the knitted html
file). If I do not use it I obtain the following:
fbplot(f_data_shape, main="Shape outliers")
## $Depth
## [1] 0.11941794 0.47258526 0.32503850 0.50875688 0.51202120 0.42418442
## [7] 0.05471947 0.44966697 0.34153615 0.41094309 0.49388739 0.25033103
## [13] 0.46148415 0.51209321 0.39750175 0.04440644 0.16522252 0.15912591
## [19] 0.51308931 0.35234923 0.41615962 0.25932393 0.49001100 0.50787679
## [25] 0.35572957 0.32413041 0.18964096 0.21727573 0.18282428 0.09815182
## [31] 0.45249925 0.44432643 0.25574757 0.39053305 0.44882288 0.49020702
## [37] 0.50445645 0.22153215 0.40049405 0.14020402 0.31457346 0.51088509
## [43] 0.48876288 0.44612261 0.21853985 0.15872187 0.32537054 0.50789279
## [49] 0.50629263 0.47253325 0.48965097 0.19740574 0.50759276 0.51214521
## [55] 0.43428943 0.46780878 0.02000000 0.30044804 0.29381538 0.47728973
## [61] 0.07658566 0.51019702 0.08822682 0.37105511 0.48971897 0.46210421
## [67] 0.46548055 0.34415642 0.27434543 0.48607861 0.30566457 0.35639364
## [73] 0.50384438 0.12681068 0.48860286 0.41711571 0.50476048 0.36390239
## [79] 0.48821482 0.23406941 0.37900390 0.10801280 0.42349635 0.02674867
## [85] 0.50020802 0.40249825 0.16453445 0.25549155 0.29016702 0.06704070
## [91] 0.38340834 0.27212521 0.36185019 0.37102310 0.21195920 0.37791579
## [97] 0.36079408 0.18767277 0.38006801 0.37687169
##
## $Fvalue
## [1] 1.5
##
## $ID_outliers
## [1] 57
outliergram(f_data_shape)
## $Fvalue
## [1] 1.5
##
## $d
## [1] 1.207844e-03 3.293399e-03 1.788100e-03 2.606875e-03 2.973406e-03
## [6] 2.353622e-03 6.641456e-04 1.843709e-03 9.445697e-04 1.423826e-03
## [11] 3.408460e-03 2.778773e-03 3.102488e-03 2.845790e-03 1.662067e-03
## [16] 3.659772e-05 3.524313e-04 1.440025e-03 1.899754e-03 2.638600e-03
## [21] 2.334570e-03 1.505972e-03 2.209369e-03 1.716607e-03 6.064369e-04
## [26] 3.671654e-03 1.726549e-03 1.964791e-03 1.459829e-03 1.160235e-03
## [31] 2.783249e-03 1.235807e-03 5.213987e-04 2.320628e-03 2.403805e-03
## [36] 3.966139e-03 1.903874e-03 1.370513e-03 2.375525e-03 1.346630e-03
## [41] 3.234937e-03 3.455316e-03 3.201904e-03 2.374574e-03 2.412915e-04
## [46] 3.344097e-04 7.228050e-04 2.146314e-03 3.383229e-03 2.054542e-03
## [51] 4.073001e-03 1.761364e-04 2.326688e-03 2.406221e-03 2.324312e-03
## [56] 2.424124e-03 -4.267420e-16 3.873259e-04 5.901580e-04 3.073020e-03
## [61] 8.517683e-04 3.343186e-03 1.261116e-04 1.014121e-03 1.513894e-03
## [66] 2.671832e-03 1.909736e-03 8.865837e-04 1.715776e-03 1.881218e-03
## [71] 1.248838e-03 9.497583e-04 3.332809e-03 1.524509e-04 2.803251e-03
## [76] 1.553264e-03 3.588953e-03 2.145957e-03 1.743065e-03 4.286369e-04
## [81] 2.378654e-03 1.434203e-04 1.359542e-03 1.577584e-04 2.259632e-03
## [86] 1.311814e-03 2.081396e-04 2.044561e-04 1.985545e-03 9.707901e-05
## [91] 1.198346e-01 1.259825e-01 9.742135e-02 1.002164e-01 1.050986e-01
## [96] 1.337795e-01 1.378603e-01 1.794469e-02 1.348418e-01 1.147931e-01
##
## $ID_outliers
## [1] 91 92 93 94 95 96 97 98 99 100
We appreciate how the shape outliers are for the entirely missed by the functional boxplot, whereas the outliergram effectively uncovers them.
As usual, saving the output of the plots to an object allows to recover the ID of the identified outliers:
out_shape <- outliergram(f_data_shape, display = FALSE)
out_shape$ID_outliers
## [1] 91 92 93 94 95 96 97 98 99 100
The outliergram is actually based on some simple manipulations of MEI and MBD, that can be hard coded:
MEI_out_shape <- MEI(f_data_shape)
MBD_out_shape <- MBD(f_data_shape)
a_0 <- a_2 <- -2/(f_data_shape$N*(f_data_shape$N-1))
a_1 <- 2*(f_data_shape$N+1)/(f_data_shape$N-1)
d_manual <- a_0+a_1*MEI_out_shape+a_2*f_data_shape$N^2*MEI_out_shape^2-MBD_out_shape
ID_outliers_manual <- which(d_manual>quantile(d_manual,probs = .75)+out_shape$Fvalue*IQR(x = d_manual))
Let us check that our manual computations are correct:
all(dplyr::near(d_manual, out_shape$d))
## [1] TRUE
all(ID_outliers_manual==out_shape$ID_outliers)
## [1] TRUE
One last comment: we have seen in class that the value \(F\) used to build the fences in the
functional boxplot can be adjusted. Despite the algorithmic
implementation/ mathematical theory behind it being quite tedious, the
automated selection of \(F\) via the
roahd package is actually pretty easy:
set.seed(22)
fbplot(f_data_mag, main="Magnitude outliers",adjust = list(VERBOSE=FALSE))
## $Depth
## [1] 0.17158716 0.48351435 0.36043404 0.51197720 0.51346535 0.44345035
## [7] 0.11140114 0.44828283 0.37454345 0.41143114 0.50079208 0.29340334
## [13] 0.47406541 0.51069707 0.39913391 0.07915392 0.19287329 0.20865087
## [19] 0.51244124 0.38464846 0.43680168 0.30039204 0.49752775 0.50558056
## [25] 0.36416642 0.35985799 0.23734973 0.26231223 0.23089309 0.15360136
## [31] 0.44980698 0.44241424 0.27598960 0.41755976 0.46359236 0.49864386
## [37] 0.50922892 0.26684068 0.42570057 0.18997300 0.35127313 0.51342134
## [43] 0.49773177 0.46192419 0.24127413 0.18600060 0.33606361 0.51169317
## [49] 0.50508051 0.46992899 0.48732673 0.22400840 0.50558056 0.51399340
## [55] 0.45174317 0.46573257 0.04924692 0.31222522 0.30632063 0.48883488
## [61] 0.13333533 0.51051305 0.11827383 0.37881988 0.48634263 0.46022802
## [67] 0.47826583 0.35446945 0.31427743 0.48341834 0.31714971 0.36421842
## [73] 0.50850485 0.15533353 0.48600260 0.41882788 0.50191619 0.39486149
## [79] 0.49565157 0.25567157 0.40775478 0.13715172 0.42486449 0.08306631
## [85] 0.49748375 0.40545055 0.19164516 0.27580158 0.32959496 0.09931193
## [91] 0.18090009 0.22688469 0.17876788 0.14172017 0.02582458 0.21552755
## [97] 0.14330033 0.20568257 0.25305131 0.07321732
##
## $Fvalue
## [1] 1.545459
##
## $ID_outliers
## [1] 91 92 93 94 95 96 97 98 100
outliergram(f_data_mag,adjust = list(VERBOSE=FALSE))
## $Fvalue
## [1] 4.867737
##
## $d
## [1] 1.269870e-03 2.801349e-03 1.347818e-03 1.719855e-03 1.165661e-03
## [6] 1.591763e-03 1.409289e-03 1.082484e-03 1.435708e-03 2.820559e-03
## [11] 2.784199e-03 1.662661e-03 2.588734e-03 3.433730e-03 2.127738e-03
## [16] 1.071711e-03 5.723741e-04 1.179009e-03 2.221489e-03 2.030381e-03
## [21] 1.458126e-03 1.613983e-03 2.153483e-03 1.013606e-03 1.881931e-03
## [26] 2.584179e-03 1.602418e-03 1.770791e-03 1.065136e-03 1.244600e-03
## [31] 2.973802e-03 1.506765e-03 1.308210e-03 1.632639e-03 1.875118e-03
## [36] 2.498666e-03 1.236718e-03 1.034480e-03 1.672603e-03 1.234381e-03
## [41] 1.908270e-03 1.560235e-03 1.704646e-03 1.198496e-03 1.362552e-03
## [46] 1.834797e-03 1.472266e-03 1.182732e-03 1.693318e-03 1.428222e-03
## [51] 2.497359e-03 1.323023e-03 1.446560e-03 1.006002e-03 1.528074e-03
## [56] 1.165146e-03 7.784541e-04 1.437173e-03 1.106012e-03 8.102592e-04
## [61] 1.303734e-03 1.084425e-03 1.113735e-03 1.118448e-03 1.292248e-03
## [66] 1.604636e-03 8.816723e-04 9.684137e-04 1.651769e-03 1.155640e-03
## [71] 2.155622e-03 2.264108e-03 2.493754e-03 1.199526e-03 1.725321e-03
## [76] 1.407071e-03 3.198815e-03 1.604636e-03 2.090823e-03 1.598576e-03
## [81] 1.615053e-03 1.123835e-03 1.254502e-03 1.232559e-03 1.278385e-03
## [86] 6.934159e-04 1.800497e-03 1.496229e-03 1.601269e-03 1.088188e-03
## [91] 3.157153e-01 2.550829e-01 1.907362e-01 3.600123e-01 9.818804e-05
## [96] 5.998303e-02 1.709803e-02 2.930078e-01 2.140930e-01 6.755923e-04
##
## $ID_outliers
## [1] 91 92 93 94 96 97 98 99